Many colleges want to optimize the money they receive from their alumni. In order to do so, they need to identify and predict the salary/unemployment rate of recent graduates based on their education and other various factors. Doing so, they will be able to put more money into those programs to get a larger return on their investments (students).
Business Question:
Where can colleges put money in order to optimize the amount of money they receive from recent graduates?
Analysis Question:
Based on recent graduates and their characteristics/education, what would be their predicted median salary? Would they make over or less than six figures?
This data is pulled from the 2012-12 American Community Survey Public Use Microdata Series, and is limited to those users under the age of 28. The general purpose of this code and data is based upon this story.
What will we be doing? Methods, techniques, why?
A brief look at the raw data can be found below.
## 'data.frame': 172 obs. of 21 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Major_code : int 2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
## $ Major : chr "PETROLEUM ENGINEERING" "MINING AND MINERAL ENGINEERING" "METALLURGICAL ENGINEERING" "NAVAL ARCHITECTURE AND MARINE ENGINEERING" ...
## $ Total : int 2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
## $ Men : int 2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
## $ Women : int 282 77 131 135 11021 373 1667 960 10907 16016 ...
## $ Major_category : chr "Engineering" "Engineering" "Engineering" "Engineering" ...
## $ ShareWomen : num 0.121 0.102 0.153 0.107 0.342 ...
## $ Sample_size : int 36 7 3 16 289 17 51 10 1029 631 ...
## $ Employed : int 1976 640 648 758 25694 1857 2912 1526 76442 61928 ...
## $ Full_time : int 1849 556 558 1069 23170 2038 2924 1085 71298 55450 ...
## $ Part_time : int 270 170 133 150 5180 264 296 553 13101 12695 ...
## $ Full_time_year_round: int 1207 388 340 692 16697 1449 2482 827 54639 41413 ...
## $ Unemployed : int 37 85 16 40 1672 400 308 33 4650 3895 ...
## $ Unemployment_rate : num 0.0184 0.1172 0.0241 0.0501 0.0611 ...
## $ Median : int 110000 75000 73000 70000 65000 65000 62000 62000 60000 60000 ...
## $ P25th : int 95000 55000 50000 43000 50000 50000 53000 31500 48000 45000 ...
## $ P75th : int 125000 90000 105000 80000 75000 102000 72000 109000 70000 72000 ...
## $ College_jobs : int 1534 350 456 529 18314 1142 1768 972 52844 45829 ...
## $ Non_college_jobs : int 364 257 176 102 4440 657 314 500 16384 10874 ...
## $ Low_wage_jobs : int 193 50 0 0 972 244 259 220 3253 3170 ...
## - attr(*, "na.action")= 'omit' Named int 22
## ..- attr(*, "names")= chr "22"
As can be seen above, many of the categories are integer values. Many of these variables can be converted into factor variables in addition to the numerical ones. In addition, the variables Rank, Major Code, and Major can be dropped as the Rank variable highly correlates with the salary variable, and the other two are to specific and cannot be generalized.
majors_added_categorical <- majors_raw %>% mutate(Over.50K = ifelse(Median > 50000, "Over", "Under.Equal"), High.Unemployment = ifelse(Unemployment_rate > 0.5, "High", "Low")) %>% select(-1, -2, -3)
In addition, the categorical variable categories can be compressed in order for more useful data for the analysis.
##
## Sciences Arts Other STEM
## 54 30 48 40
In order to do some analysis, all categorical variables need to be one hot encoded, which is done below:
# One Hot Encoded Data
majors_onehot <- one_hot(data.table(majors_factors), cols = c("Major_category", "High.Unemployment"))
# Normal Data
majors <- majors_factors
Before beginning with the analytical part of the exploration, it is beneficial to visualize and summarize the data in order to get a better understanding of the data in its entirety, and with an emphasis on variables you believe to be important for your analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22000 33000 36000 40077 45000 110000
## Total Men Women ShareWomen Sample_size Employed
## Total 1.0000000 0.8780884 0.9447645 0.1429993 0.9455747 0.9962140
## Men 0.8780884 1.0000000 0.6727589 -0.1120136 0.8751756 0.8706047
## Women 0.9447645 0.6727589 1.0000000 0.2978321 0.8626064 0.9440365
## ShareWomen 0.1429993 -0.1120136 0.2978321 1.0000000 0.0974957 0.1475468
## Sample_size 0.9455747 0.8751756 0.8626064 0.0974957 1.0000000 0.9644062
## Full_time Part_time Full_time_year_round Unemployed
## Total 0.9893392 0.9502684 0.9811118 0.9747684
## Men 0.8935631 0.7515917 0.8924540 0.8694115
## Women 0.9176812 0.9545133 0.9057195 0.9116943
## ShareWomen 0.1202001 0.2122898 0.1125230 0.1212430
## Sample_size 0.9783624 0.8245444 0.9852125 0.9179335
## Unemployment_rate Median P25th P75th College_jobs
## Total 0.08319170 -0.1067377 -0.07192608 -0.08319767 0.8004648
## Men 0.10150234 0.0259906 0.03872518 0.05239290 0.5631684
## Women 0.05910776 -0.1828419 -0.13773826 -0.16452834 0.8519460
## ShareWomen 0.07320458 -0.6186898 -0.50019863 -0.58693216 0.1955501
## Sample_size 0.06295494 -0.0644750 -0.02442859 -0.05225614 0.7012309
## Non_college_jobs Low_wage_jobs
## Total 0.9412471 0.9355096
## Men 0.8514998 0.7913360
## Women 0.8721318 0.9044699
## ShareWomen 0.1370066 0.1878496
## Sample_size 0.9153352 0.8601159
## [1] 172 22
## [1] 121 22
## [1] 26 22
## [1] 25 22
## Classes 'data.table' and 'data.frame': 121 obs. of 21 variables:
## $ Total : int 2339 756 856 2573 1792 81527 41542 14955 4321 8925 ...
## $ Men : int 2057 679 725 2200 832 65511 33258 8407 3526 6062 ...
## $ Women : int 282 77 131 373 960 16016 8284 6548 795 2863 ...
## $ Major_category_Sciences: int 0 0 0 0 1 0 0 0 0 0 ...
## $ Major_category_Arts : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Major_category_Other : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Major_category_STEM : int 1 1 1 1 0 1 1 1 1 1 ...
## $ ShareWomen : num 0.121 0.102 0.153 0.145 0.536 ...
## $ Sample_size : int 36 7 3 17 10 631 399 79 30 55 ...
## $ Employed : int 1976 640 648 1857 1526 61928 32506 10047 3608 6170 ...
## $ Full_time : int 1849 556 558 2038 1085 55450 30315 9017 2999 5455 ...
## $ Part_time : int 270 170 133 264 553 12695 5146 2694 811 1983 ...
## $ Full_time_year_round : int 1207 388 340 1449 827 41413 23621 5986 2004 3413 ...
## $ Unemployed : int 37 85 16 400 33 3895 2275 1019 23 589 ...
## $ Unemployment_rate : num 0.0184 0.1172 0.0241 0.1772 0.0212 ...
## $ P25th : int 95000 55000 50000 50000 31500 45000 45000 36000 25000 40000 ...
## $ P75th : int 125000 90000 105000 102000 109000 72000 75000 70000 74000 76000 ...
## $ College_jobs : int 1534 350 456 1142 972 45829 23694 6439 2439 3603 ...
## $ Non_college_jobs : int 364 257 176 657 500 10874 5721 2471 947 1595 ...
## $ Low_wage_jobs : int 193 50 0 244 220 3170 980 789 263 524 ...
## $ High.Unemployment_Low : int 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
## C5.0
##
## 121 samples
## 21 predictor
## 2 classes: 'Over', 'Under.Equal'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 109, 109, 108, 110, 109, 110, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.9063986 0.5958991
## rules FALSE 10 0.9128089 0.6231278
## rules FALSE 20 0.9143473 0.6326592
## rules TRUE 1 0.9234149 0.6373265
## rules TRUE 10 0.9200816 0.6287551
## rules TRUE 20 0.9184149 0.6212551
## tree FALSE 1 0.9047319 0.6025318
## tree FALSE 10 0.9128089 0.6283924
## tree FALSE 20 0.9140909 0.6399963
## tree TRUE 1 0.9217483 0.6352432
## tree TRUE 10 0.9147786 0.6193285
## tree TRUE 20 0.9147786 0.6193285
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = rules and winnow
## = TRUE.
## Confusion Matrix and Statistics
##
## Actual
## Prediction Over Under.Equal
## Over 3 0
## Under.Equal 1 22
##
## Accuracy : 0.9615
## 95% CI : (0.8036, 0.999)
## No Information Rate : 0.8462
## P-Value [Acc > NIR] : 0.07441
##
## Kappa : 0.8354
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.7500
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9565
## Prevalence : 0.1538
## Detection Rate : 0.1154
## Detection Prevalence : 0.1154
## Balanced Accuracy : 0.8750
##
## 'Positive' Class : Over
##
# Given a certain values for the other variables predict the Median Salary
## C5.0 variable importance
##
## only 20 most important variables shown (out of 21)
##
## Overall
## P75th 100
## Unemployed 0
## Low_wage_jobs 0
## ShareWomen 0
## Unemployment_rate 0
## Sample_size 0
## High.Unemployment_Low 0
## Major_category_STEM 0
## Full_time_year_round 0
## Total 0
## Men 0
## Employed 0
## Non_college_jobs 0
## Part_time 0
## Major_category_Arts 0
## P25th 0
## College_jobs 0
## Major_category_Sciences 0
## Women 0
## Major_category_Other 0
## C5.0
##
## 121 samples
## 21 predictor
## 2 classes: 'Over', 'Under.Equal'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 109, 109, 108, 110, 109, 110, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 20 0.9143473 0.6326592
## rules FALSE 30 0.9142191 0.6272642
## rules FALSE 40 0.9158858 0.6379460
## rules TRUE 20 0.9184149 0.6212551
## rules TRUE 30 0.9184149 0.6212551
## rules TRUE 40 0.9184149 0.6212551
## tree FALSE 20 0.9140909 0.6399963
## tree FALSE 30 0.9158858 0.6404803
## tree FALSE 40 0.9158858 0.6404803
## tree TRUE 20 0.9147786 0.6193285
## tree TRUE 30 0.9147786 0.6193285
## tree TRUE 40 0.9147786 0.6193285
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = rules and
## winnow = TRUE.
## C5.0
##
## 121 samples
## 21 predictor
## 2 classes: 'Over', 'Under.Equal'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 109, 109, 108, 110, 109, 110, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.9063986 0.5958991
## rules FALSE 10 0.9128089 0.6231278
## rules FALSE 20 0.9143473 0.6326592
## rules TRUE 1 0.9234149 0.6373265
## rules TRUE 10 0.9200816 0.6287551
## rules TRUE 20 0.9184149 0.6212551
## tree FALSE 1 0.9047319 0.6025318
## tree FALSE 10 0.9128089 0.6283924
## tree FALSE 20 0.9140909 0.6399963
## tree TRUE 1 0.9217483 0.6352432
## tree TRUE 10 0.9147786 0.6193285
## tree TRUE 20 0.9147786 0.6193285
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = rules and winnow
## = TRUE.
## Confusion Matrix and Statistics
##
## Actual
## Prediction Over Under.Equal
## Over 4 0
## Under.Equal 0 22
##
## Accuracy : 1
## 95% CI : (0.8677, 1)
## No Information Rate : 0.8462
## P-Value [Acc > NIR] : 0.01299
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.1538
## Detection Rate : 0.1538
## Detection Prevalence : 0.1538
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : Over
##
## Confusion Matrix and Statistics
##
## Actual
## Prediction Over Under.Equal
## Over 2 3
## Under.Equal 1 19
##
## Accuracy : 0.84
## 95% CI : (0.6392, 0.9546)
## No Information Rate : 0.88
## P-Value [Acc > NIR] : 0.8266
##
## Kappa : 0.4118
##
## Mcnemar's Test P-Value : 0.6171
##
## Sensitivity : 0.6667
## Specificity : 0.8636
## Pos Pred Value : 0.4000
## Neg Pred Value : 0.9500
## Prevalence : 0.1200
## Detection Rate : 0.0800
## Detection Prevalence : 0.2000
## Balanced Accuracy : 0.7652
##
## 'Positive' Class : Over
##
## [1] 0.3953488
##
## LE.EQ.20K G.50K
## 104 68
## [1] 121 21
## [1] 25 21
## [1] 26 21
## [1] 4.472136
## X1.nrow.combined_RF.err.rate. OOB LE.EQ.20K G.50K
## 1 1 0.2982456 0.3666667 0.2222222
## 2 2 0.2325581 0.2549020 0.2000000
## 3 3 0.2475248 0.2372881 0.2619048
## 4 4 0.2110092 0.1904762 0.2391304
## 5 5 0.2280702 0.1617647 0.3260870
## 6 6 0.2288136 0.1830986 0.2978723
## 7 7 0.2000000 0.1506849 0.2765957
## 8 8 0.2250000 0.1917808 0.2765957
## 9 9 0.1735537 0.1095890 0.2708333
## 10 10 0.1900826 0.1232877 0.2916667
## 'data.frame': 121 obs. of 21 variables:
## $ Total : int 2339 756 1258 32260 3777 1792 91227 81527 15058 14955 ...
## $ Men : int 2057 679 1123 21239 2110 832 80320 65511 12953 8407 ...
## $ Women : int 282 77 135 11021 1667 960 10907 16016 2105 6548 ...
## $ Major_category : Factor w/ 4 levels "Sciences","Arts",..: 4 4 4 4 3 1 4 4 4 4 ...
## $ ShareWomen : num 0.121 0.102 0.107 0.342 0.441 ...
## $ Sample_size : int 36 7 16 289 51 10 1029 631 147 79 ...
## $ Employed : int 1976 640 758 25694 2912 1526 76442 61928 11391 10047 ...
## $ Full_time : int 1849 556 1069 23170 2924 1085 71298 55450 11106 9017 ...
## $ Part_time : int 270 170 150 5180 296 553 13101 12695 2724 2694 ...
## $ Full_time_year_round: int 1207 388 692 16697 2482 827 54639 41413 8790 5986 ...
## $ Unemployed : int 37 85 40 1672 308 33 4650 3895 794 1019 ...
## $ Unemployment_rate : num 0.0184 0.1172 0.0501 0.0611 0.0957 ...
## $ Median : int 110000 75000 70000 65000 62000 62000 60000 60000 60000 60000 ...
## $ P25th : int 95000 55000 43000 50000 53000 31500 48000 45000 42000 36000 ...
## $ P75th : int 125000 90000 80000 75000 72000 109000 70000 72000 70000 70000 ...
## $ College_jobs : int 1534 350 529 18314 1768 972 52844 45829 8184 6439 ...
## $ Non_college_jobs : int 364 257 102 4440 314 500 16384 10874 2425 2471 ...
## $ Low_wage_jobs : int 193 50 0 972 259 220 3253 3170 372 789 ...
## $ Over.50K : Factor w/ 2 levels "Over","Under.Equal": 1 1 1 1 1 1 1 1 1 1 ...
## $ High.Unemployment : Factor w/ 1 level "Low": 1 1 1 1 1 1 1 1 1 1 ...
## $ combined_target : Factor w/ 2 levels "LE.EQ.20K","G.50K": 1 1 1 2 2 2 1 1 1 2 ...
## mtry = 4 OOB error = 20.66%
## Searching left ...
## mtry = 2 OOB error = 19.01%
## 0.08 0.05
## mtry = 1 OOB error = 29.75%
## -0.5652174 0.05
## Searching right ...
## mtry = 8 OOB error = 14.88%
## 0.2173913 0.05
## mtry = 16 OOB error = 9.09%
## 0.3888889 0.05
## mtry = 20 OOB error = 11.57%
## -0.2727273 0.05
##
## Call:
## randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1])
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 16
##
## OOB estimate of error rate: 12.4%
## Confusion matrix:
## LE.EQ.20K G.50K class.error
## LE.EQ.20K 65 8 0.1095890
## G.50K 7 41 0.1458333
Because the built in Random Forest Model was not agreeable with the tuning done with the caret library, an original random forest classification tuning metric was created in order to determine the best values for the three hyperparameters determined above.
Now, we can set the hyperparameter values to try and tune the model.
## .mtry .sampsize .ntree
## 1 3 50 200
## 2 4 50 200
## 3 5 50 200
## 4 3 100 200
## 5 4 100 200
## 6 5 100 200
## 7 3 200 200
## 8 4 200 200
## 9 5 200 200
## 10 3 50 300
## 11 4 50 300
## 12 5 50 300
## 13 3 100 300
## 14 4 100 300
## 15 5 100 300
## 16 3 200 300
## 17 4 200 300
## 18 5 200 300
## 19 3 50 400
## 20 4 50 400
## 21 5 50 400
## 22 3 100 400
## 23 4 100 400
## 24 5 100 400
## 25 3 200 400
## 26 4 200 400
## 27 5 200 400
## 121 samples
## 19 predictor
## 2 classes: 'Over', 'Under.Equal'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 97, 97, 97, 96, 97, 97, ...
## Resampling results across tuning parameters:
##
## mtry sampsize ntree ROC Sens Spec
## 3 50 200 0.9903810 0.8533333 1.0000000
## 3 50 300 0.9910159 0.8300000 1.0000000
## 3 50 400 0.9910159 0.8266667 1.0000000
## 3 100 200 0.9913333 0.8500000 1.0000000
## 3 100 300 0.9871429 0.8266667 1.0000000
## 3 100 400 0.9910159 0.8300000 1.0000000
## 3 200 200 0.9910159 0.8300000 1.0000000
## 3 200 300 0.9897460 0.8400000 1.0000000
## 3 200 400 0.9903492 0.8400000 0.9980952
## 4 50 200 0.9913333 0.8633333 1.0000000
## 4 50 300 0.9916508 0.8666667 1.0000000
## 4 50 400 0.9910159 0.8533333 1.0000000
## 4 100 200 0.9909841 0.9000000 1.0000000
## 4 100 300 0.9897143 0.8666667 1.0000000
## 4 100 400 0.9916508 0.8766667 1.0000000
## 4 200 200 0.9906984 0.8766667 1.0000000
## 4 200 300 0.9925873 0.8666667 1.0000000
## 4 200 400 0.9929206 0.8533333 1.0000000
## 5 50 200 0.9916508 0.9233333 1.0000000
## 5 50 300 0.9910159 0.8966667 1.0000000
## 5 50 400 0.9922857 0.8966667 1.0000000
## 5 100 200 0.9903810 0.8966667 1.0000000
## 5 100 300 0.9916508 0.8866667 1.0000000
## 5 100 400 0.9916508 0.9100000 1.0000000
## 5 200 200 0.9922857 0.8866667 1.0000000
## 5 200 300 0.9910159 0.8866667 1.0000000
## 5 200 400 0.9916508 0.8633333 1.0000000
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, ntree = 400 and sampsize
## = 200.
# Evaluation of Model
What can you say about the results of the methods section as it relates to your question given the limitations to your model?
What additional analysis is needed or what limited your analysis on this project?